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Abstract 

Analysis of non-asymptotic estimation error and structured statistical recovery based on norm regu¬ 
larized regression, such as Lasso, needs to consider four aspects; the norm, the loss function, the design 
matrix, and the noise model. This paper presents generalizations of such estimation error analysis on all 
four aspects compared to the existing literature. We characterize the restricted error set where the estima¬ 
tion error vector lies, establish relations between error sets for the constrained and regularized problems, 
and present an estimation error bound applicable to any norm. Precise characterizations of the bound is 
presented for isotropic as well as anisotropic subGaussian design matrices, subGaussian noise models, 
and convex loss functions, including least squares and generalized linear models. Generic chaining and 
associated results play an important role in the analysis. A key result from the analysis is that the sample 
complexity of all such estimators depends on the Gaussian width of a spherical cap corresponding to the 
restricted error set. Further, once the number of samples n crosses the required sample complexity, the 
estimation error decreases as where c depends on the Gaussian width of the unit norm ball. 


1 Introduction 

Over the past decade, progress has been made in developing non-asymptotic bounds on the estimation error 
of structured parameters based on norm regularized regression. Such estimators are usually of the form ll^ 

mil: 

= argmin£(6';Z"') + Anii(6') , (1) 

6»eRp 

where R{9) is a suitable norm, £(•) is a suitable loss function, Z” = {{yi, where yi G M, Xj G 

is the training set, and > 0 is a regularization parameter. The optimal parameter 9* is often assumed to be 
‘structured,’ usually characterized or approximated as a small value according to some norm R{-). Recent 
work has viewed such characterizations in terms of atomic norms, which give the tightest convex relaxation 
of a structured set of atoms in which 9* belongs iflTIl . Since is an estimate of the optimal structure 9*, 
the focus has been on bounding a suitable measure of the error vector A„ = ( 0 a „ — 9*), e.g., the L 2 norm 

ll^nlb- 

To understand the state-of-the-art on non-asymptotic bounds on the estimation error for norm-regularized 
regression, four aspects of Q need to be considered: (i) the norm R{9), (ii) properties of the design matrix 
X = [Xi ■ ■ ■ XnY £ (hi) the loss function £(•), and (iv) the noise model, typically in terms of 

oJi = yi — E[y\Xi]. Most of the literature has focused on a linear model: y = X9 + oj, and a squared-loss 
function: £(0; Z"') = ^\\y — X9\\2 = ^ ~ (^) Early work on such estimators focussed on 
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the Li norm B31I^|2^ . and led to sufficient conditions on the design matrix X, including the restricted- 
isometry properties (RIP) and restricted eigenvalue (RE) conditions l|6l|23|33l. While much of the 

development has focussed on isotropic Gaussian design matrices, recent work has extended the analysis for 
Li norm to correlated Gaussian designs |[^ as well as anisotropic sub-Gaussian design matrices |[34l . 

Building on such development, |[^ presents a unified framework for fhe case of decomposable norms and 
also considers generalized linear models (GLMs) for cerfain norms such as Li. Two key insighfs are offered 
in II29II : firsf, fhe error vecfor lies in a resfricfed sef, a cone or a sfar, for suifably large An, and second, 
fhe loss funcfion needs fo satisfy resfricfed sfrong convexify (RSC), a generalizafion of fhe RE condifion, on 
fhe resfricfed error sef for fhe analysis fo work ouf. 

Eor isofropic Gaussian design mafrices, addifional progress has been made. 1141 considers a consfrained 
esfimafion formulafion for all afomic norms, where fhe gain condifion, equivalenf fo fhe RE condifion, uses 
Gordon’s inequalify ||T9l |20l 1^ and is succincfly represenfed in ferms of fhe Gaussian widfh of fhe in- 
fersecfion of fhe cone of fhe error sef and a unif ball/sphere. lOTl considers fhree relafed formulafions for 
generalized Lasso problems, esfablish recovery guaranfees based on Gordon’s inequalify, and quantifies re¬ 
lafed fo fhe Gaussian widfh. Sharper analysis for recovery has been considered in |T], yielding a precise 
characferizafion of phase fransifion behavior using quanfifies related fo fhe Gaussian widfh. |[^ consider a 
linear programming estimator in a 1-bif compressed sensing seffing and, inferesfingly, fhe concepf of Gaus¬ 
sian widfh shows up in fhe analysis. In spife of fhe advances, wifh a few nofable excepfions Il40ll42ll . mosf 
exisfing resulfs are resfricfed fo isofropic Gaussian design mafrices. Eurfher, while a suifable scale for An is 
known for special cases such as fhe Li, a general analysis applicable fo any norm R{-) has nof been explored 
in fhe liferafure. 

In fhis paper, we consider sfrucfured esfimafion problems wifh norm regularizafion of fhe form ([T]), and 
presenf a unified analysis which subsfanfially generalizes exisfing resulfs on all four perfinenf aspecfs: fhe 
norm, fhe design mafrix, fhe loss, and fhe noise model. The analysis we presenf applies fo all norms, and fhe 
resulfs can be divided info fhree groups: characferizafion of fhe error sef and recovery guaranfees, charac¬ 
ferizafion of fhe regularizafion parameter and characferizafion of fhe resfricfed eigenvalue conditions or 
resfricfed sfrong convexify. We provide a summary of fhe key resulfs below. 

Restricted error set: We sfarf wifh a characferizafion of fhe error sef Er in which fhe error vecfor A„ 
belongs. Eor a suifably large Xn, we show fhaf A^, belongs fo fhe resfricfed error sef 

R{e* + A) < R{e*) + ii?(A) I , ( 2 ) 

where /3 > 1 is a consfanf. The resfricfed error sef has inferesfing sfrucfure, and forms fhe basis of subsequenf 
analysis for bounds on || A,i||2. 

Regularized vs. constrained estimators: As an alternative to regularized estimators, the literature has 
considered constrained estimators which directly focus on minimizing R{6) under suitable constraints de¬ 
termined by the noise {y — X9) and/or the design matrix X lfT3l |6j [T4j |T5|. A recent example of such a 
constrained estimator is the generalized Dantzig selector (GDS) ifTSl . which generalizes the Dantzig selec¬ 
tor Ifm corresponding to the Li norm, and is given by: 

9^^ = argmini/(6') s.t. R*{X'^ {y - X9*)) <'jn , (3) 

e^Rp 

where R*{-) denotes the dual norm of R{-). One can show llT4l [T5]| that the restricted error set for such 
constrained estimators are of the form: 

Ec = {AgRP I R{9* + A)< R{9*)} . (4) 


Er = <A£ 
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One can readily see that Er is larger than Ec, i.e., Ec C E^, and Er approaches Ec as /3 increases. We 
establish a geometric relationship between the two sets, which will possibly help in transforming analysis 
done on regularized estimators as in ([T]l to corresponding constrained estimators as in Q and vice versa. Let 
be a L2 ball of radius 1 in M^. Then, with = Ej. = Ecf^ and Ac = cone(£'c) n B^, 

assuming ||0* II 2 = 1, /3 = 2, we show that 

w{Ac) < w{Ar) < 3w{Ac) , (5) 

where w{A) = £^g[sup„g^(a, p)], with g ~ N{0,lpxp) being an isotropic Gaussian vector, denotes the 
Gaussian widthQof the set A ll^ l5l [T4llT7l . Note that corresponds to the spherical cap of the error set 
Er, and Ac corresponds to the spherical cap of the error cone cone(£'c) at the unit ball. Interestingly, the 
above relationship between the widths of these spherical caps is geometric, and applies for any norm R{-). 

We establish a more general version of the above relationship. Let pB^ denote a L2 ball of any radius p in 
M^. Then, with A^f^ = Erf^ pB^, Ai^'^ = EcD pB^, and A^/^ = cone(£'c) n pBl^, we show that 

w{A\P^) < w{A\F^) <(^l + w{AiP^) • ( 6 ) 

As before, except for the scaling constants, the relationship between the restricted error sets is geometric, 
and does not change based on the choice of the norm R{-). 

For the special case of Li norm, 10 considered a simultaneous analysis of the Lasso and the Dantzig selector, 
and characterized the structure of the error sets for regularized and constrained sets for the special case of Li 
norm. Further, while the characterization in 0 was also geometric, it was not based on Gaussian widths. In 
contrast, our results apply to any norm, not just Li, and the geometric characterization is based on Gaussian 
widths. The utility of the Gaussian width based characterization becomes evident later when we establish 
sample complexity results for Gaussian and sub-Gaussian random matrices in terms of Gaussian widths of 
spherical caps. 

Bounds on estimation error: We establish bounds on the estimation error under two assumptions, 
which are subsequently shown to hold with high probability for sub-Gaussian designs and noise models. 
The first assumption is that the regularization parameter is suitably large. In particular, for any /3 > 1, 
the regularization parameter needs to satisfy 

An>/3i?*(V£(r;Z’^)), (7) 

where R*{-) denotes the dual norm of R{-). The second assumption is that the design matrix X G 
satisfies the restricted strong convexity (RSC) condition 0|23 in the error set Er, in particular, there exists 
a suitable constant k > 0 so that 


<5£(A,r) =£(r + A) -£(r) - (V£(r),A) > k||A||| yXeEr. (8) 


With such suitably large A„ and £ satisfying the RSC condition, we establish the following bound: 


|A„||2 < Cll){Er)— , 
K 


(9) 


where ^^{Er) = sup^gg^ is a norm compatibility constant 1291 . and c > 0 is a constant. Note that the 
above bound is deterministic, but relies on assumptions on A^ and k. So, we focus on characterizations of 
\n and K which hold with high probability for sub-Gaussian design matrices X and sub-Gaussian noise uj. 
Recent work in has extended the analyses for sub-exponential distributions. 


* A gentle exposition to Gaussian width and some of its properties is given in Appendix [ a| 
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Bounds on the regularization parameter A„: From (j7]l above, for the analysis to work, one needs to have 
An > PR*{'VC{9*] Z'^)). There are a few challenges in getting a suitable bound for An- First, the bound de¬ 
pends on 9*, but 9* is unknown and is the quantity one is interested in estimating. Second, the bound depends 
on Z'^, the samples, and is hence random. The goal will be to bound the expectation E[R*{yC{9*] Z'^))] 
over all samples of size n, and obtain high-probability deviation bounds. Third, since the bound relies on the 
(dual) norm of a p-dimensional random vector, without proper care, the lower bound on A„, may end 
up having a large scaling dependency, say on the ambient dimensionality. Since the error bound in Q 
is directly proportional to A^, such dependencies will lead to weak bounds. 

In Section|^ we characterize the expectation E[R*{VC{9*] Z'^))] in terms of the geometry of the unit norm- 
ball of R, which leads to a sharp bound. Let Qr = {u G MP|ii(u) < 1} denote the unit norm-ball. Then, 
for sub-Gaussian design matrices and squared loss, we show that 

E[R*{VC{9*-Z^))] < -^winn) , (10) 

y/n 

which scales as the Gaussian width of Interestingly, for sub-Gaussian designs, one obtains the results 
in terms of the ‘sub-Gaussian width’ of the unit norm-ball, which can be upper bounded by a constant times 
the Gaussian width using generic chaining liTTlI . The result can be extended to the case of anisotropic sub- 
Gaussian designs, where the constant c starts depending on the maximum eigenvalue (operator norm) of 
the corresponding covariance matrix. Further, one can get high-probability versions of these bounds using 
related advances in generic chaining llTTl 1^ . The results can also be extended to general convex losses, 
such as those from generalized linear models. 

The above characterization allows one to choose Xn > -^w{Qr). For the special case of Li regularization, 
Q.R is the unit Li norm ball, and the corresponding Gaussian width w{Q.r) < ciy/\ogp, which explains the 
y/logp term one finds in existing bounds for Lasso ll29l 0. When working with other norms, one simply 
needs to get an upper bound on the corresponding w{Q.r). 

Restricted eigenvalue conditions: When the loss function under consideration is the squared loss, the RSC 
condition in Q reduces to the restricted eigenvalue (RE) condition on the design matrix. Our analysis focuses 
on establishing the RE condition on Ar = cone(£'r) H S^~^, the spherical cap obtained by intersecting the 
cone of the error set with the unit hypersphere, since it implies the RE condition on E^. Eor isotropic 
sub-Gaussian design matrices, a stronger two-sided restricted isometry property (RIP) holds, i.e., with high 
probability, for any A C S^~^, we have 

1 - < inf '^\\Xuf < sup ^ ||Xnf < 1 + (11) 

yjn u£An ueAn yn 

where w{A) is the Gaussian width of A. Thus, for say no = 4:C^w‘^{Ar), and for n > no, an RE condition 
of the form 

inf -||Xn||2 > 1/2, (12) 

U&Ar n 

is satisfied wifh high probabilify. Insfead of fhe consfanf fo be 1/2, one can have any consfanf less fhan 1, wifh 
suifable increase in no- Thus, one does not need to treat the RE condition as an assumption for isotropic sub- 
Gaussian designs—it always holds with high probability, with the phase transition happening at 0{w‘^{Ar)) 
samples. The RIP results can be generalized to anisotropic sub-Gaussian designs, where additional constants 
depending on the restricted eigenvalues of the anisotropic covariance matrix S show up, but the form of the 
bound stays similar. Our analysis techniques for the RIP results are based on generic chaining OTII^ . in 
particular a specific form developed in 1121112^ . 

Generalized linear models and restricted strong convexity: Eor convex loss functions, such as those com¬ 
ing from generalized linear models (GEMs), the sample complexity and associated phase transition behavior 
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is determined by the Restricted Strong Convexity (RSC) condition ||29]| . By generalizing our argument for 
RE conditions corresponding to square loss, we show that the RSC conditions are going to be satisfied for 
convex losses for subGaussian designs at the same order of sample complexity as that for squared loss. In 
particular, for we show a high probability lower bound of the form 

inf 6C{u, e*)> Cl- , (13) 

nSA Wn 


where the constants ci, C 2 > 0 depend on the tail probabilities of the design matrix distribution. Specializing 
the result to Ar = cone{Er) n 5^“^, we note that the sample complexity still scales as 0{uP‘{Ar)), similar 
to the case of squared loss. The result is thus a considerable generalization of earlier results on convex losses, 
such as GLMs, which had looked at specific norms and associafed cones and/or did nof express fhe resulfs 
in ferms of fhe Gaussian widfh of A ||29l. 

Putting everything together: With the above results in place, from Q, the main bound takes the form 


I A. 


I2 < c 




Cl - C2 


w{Ar) 

\/n 


n 


J + 


(14) 


with high probability, where is the Gaussian width of the unit norm ball, w{Ar) is the Gaussian width 

of the spherical cap corresponding to the error set cone(£'r), and the result is valid only when n > no = 
0{w‘^{Ar)) which corresponds to the sample complexity. For the special case of Li norm, i.e.. Lasso, the 
sample complexity no is of the order w‘^{Ar) = O(slogp). Further, w{Qr) = \/logp and '(/^{Er) = y/s. 

Plugging in these values, choosing /3 = 2, for n > c^slogp, the bound ||A „||2 < holds with 

probability. For other norms, one can simply plug-in the widths to get the corresponding sample complexity 
and non-asymptotic error bounds. 

The rest of the paper is organized as follows: Section [^presents results on the restricted error set and de¬ 
terministic error bounds under suitable bounds on the regularization parameter A„ and RSC assumptions. 
Section [^presents a characterization of in terms of the Gaussian width of the unit norm ball for Gaussian 
as well as sub-Gaussian designs and noise. Sectionj^proves RE conditions and associated sample complex¬ 
ity results corresponding to squared loss functions. Results are presented for subGaussian designs, including 
anisotropic and correlated cases, and always in terms of the Gaussian width of the spherical cap correspond¬ 
ing to the error set. Section [^presents RSC conditions corresponding to general convex losses arising from 
generalized linear models, and the results are again in terms of the Gaussian width of the spherical cap corre¬ 
sponding to the error set. We conclude in Section]^ All technical arguments and proofs are in the appendix, 
along with a gentle exposition to Gaussian widths and related results. 

A brief word on the notation used. We denote random matrices as X, and random vectors as Xt where i 
may be an index to a row or column of a random matrix. Vector norms are denoted as 11 • 11, e.g., 11Xj 11 2 for a 
(random) vector Xi, and norms of random variables are denoted as lll-IH, e.g., |||A |||2 = F^[||A|| 2 ]. 


2 Restricted Error Set and Recovery Guarantees 

In this section, we give a characterization of the restricted error set Er in which the error vector A„ = {6x^ — 
9*) lies, establish clear relationships between the error sets for the regularized and constrained problems, and 
finally establish upper bounds on the estimation error. The error bound is deterministic, but has quantities 
which involve 9*,X, oj, for which we develop high probability bounds in Sections|^|^ and[^ 
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Figure 1: Relationship between error set of the regularized problem {Ar, green region) and the eons trained 
problem (Ac, gray region) after intersection with a ball of radius p. While Ar will be larger in general, it will 
be within a constant factor of Ac in terms of Gaussian width (best viewed in color). 


2.1 The Restricted Error Set and the Error Cone 

We start with a characterization of the restricted error set Er where An will belong. 
Lemma 1 For any /3 > 1, assuming 


An>/3i2*(V£(r;Z")), 

where R*{-) is the dual norm of R{-). Then the error vector An = 9\^ — 9* belongs to the set 


Er = Er(9*,P) = <^ A e 


R{9* + A) < R(9*) + ^^(A) 


(15) 


(16) 


The restricted error set Er need not be convex for general norms. Interestingly, for /3 = 1, the inequality 
in ( [T^ is just the triangle inequality, and is satisfied by all A. Note that /? > 1 restricts the set of A which 
satisfy the inequality, yielding the restricted error set. In particular, A cannot go in the direction of 9*, i.e., 
A 7 ^ a9* for any a > 0. Further, note that the condition in (151 is similar to that in |[29l for (3 = 2, but the 
above characterization holds for any norm, not just decomposable norms 


While Er need not be a convex set, we establish a relationship between Er and the error set Ec corresponding 
to constrained estimators lfT3l l6l [T4l[T5l . A recent example of such a constrained estimator is the generalized 
Dantzig selector (GDS) lITSll given by: 


9,,^ = aigmin R(9) s.t. R*(X^(y - X9*)) < yr^ , (17) 

6»eMp 


where R*{-) denotes the dual norm of R{-). One can show |[T4l [TSll that the restricted error set for such 
constrained estimators l[T4l[T5ll40]| are of the form: 


Ec = {A£RP I R{9* + A) < R{9*)} . 


(18) 
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Figure 2: Schematic for norm regularized objective functions considered. The finite sample estimate 6x^ has 
lower empirical loss than the optimum 9*. Bounding the difference between the losses yields a bound on 

II^A„-r||. 


By definition, it is easy to see that Ec is always convex, and that Ec C Er, as shown schematically in 
Figure [T] 

The following results establishes a relationship between Er and Ec in terms of their Gaussian widths. 


Theorem 1 Let = Ec^A pB^, Ai^'^ = ErH pB^, and Ai^^ = Cc n pB^, where pB^ 
is the L 2 ball of any radius p > 0. Then, for any /3 > 1 we have 


w(A'p) < w{Aip) < il + ^1!^) , 


{u 


2 < p} 


(19) 


where w{A) denotes the Gaussian width of any set A given by: w{A) 
isotropic Gaussian random vector, i.e., g ~ A^(0, Ipxp)- 


= Er, 


sup (a, g) , 
a£A 


where g is an 


Thus, the Gaussian width of the error sets of regularized and constrained problems are closely related. See 
Figurefor more details. In particular, for ||0*||2 = with p = 1,;0 = 2, we have w{Ac) < w{Ar) < 
3w{Ac) as introduced in Section Related observations have been made for the special case of the Li 
norm @, although past work did not provide an explicit characterization in terms of Gaussian widths. The 
result also suggests that it is possible to move between the error analysis of the regularized and the con¬ 
strained versions of the estimation problem. 


2.2 Recovery Guarantees 

In order to establish recovery guarantees, we start by assuming that restricted strong convexity (RSC) is 
satisfied by fhe loss funcfion in Er, the error set, so that for any A G Er, there exists a suitable constant k so 
that 

6£{A, 9*) = C{9* + A) - C{9*) - {V£{9*), A) > k\\A\\1 ■ (20) 

In Sections and we establish precise forms of the RSC condition for a wide variety of design matrices 
and loss functions. In order to establish recovery guarantees, we focus on the quantity 

.F(A) = £{9* + A) - £{9*) + Xn{Ri9* + A) - R{9*)) . (21) 
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Figure 3: Schematic illustrating the error bound in LemmaUnder restricted strong convexity (RSC) of 
the loss function in the error set Er, the error || A „||2 can be bounded in terms of the gradient of the overall 
objective evaluated at 6*. 


Since 6x^ = 9* + A„ is the estimated parameter, i.e., is the minimum of the objective, we clearly have 
-^(A„) < 0, which implies a bound on ||An,|| 2 . Unlike previous analysis, the bound can be established 
without making any additional assumptions on the norm R{9). We start with the following result, which 
expresses the upper bound on || A „||2 in terms of the gradient of the objective at 9*. 

Lemma 2 Assume that the RSC condition is satisfied in Er by the loss £(•) with parameter k 
9\^ — 9*, for any norm R{-), we have 

||A„||2<-||V£(r) + A„Vi2(r)||2, 

K 

where VRf) is any sub-gradient of the norm R{-). 

Figure illustrates the above results. Note that the right hand side is simply the L 2 norm of the gradient 
of the objective evaluated at 9*. For the special case when 9x„ = 9*, the gradient of the objective is zero, 
implying correctly that ||Aji ||2 = 0. While the above result provides useful insights about the bound on 
||A„|| 2 , the quantities on the right hand side depend on 9*, which is unknown. We present another form of 
the result in terms of quantities such as A„, k, and the norm compatibility constant V'(£r) = sup^gg^ 
which are often easier to compute or bound. 


. With An = 

( 22 ) 


Theorem 2 Assume that the RSC condition is satisfied in Er by the loss £(•) with parameter k. With 
An = — 9*, for any norm R{-), we have 


\A 


n \\2 


< f{Er) 


1 + /^ An, 
K 


(23) 


The above result is deterministic, but contains An and n. In Section we give precise characterizations 
of An, which needs to satisfy (151. In Sections and we characterize the RSC condition constant k for 
different losses and a variety of design matrices. 


2.3 A Special Case: Decomposable Norms 

In recent work, ll29l considered regularized regression with the special case of decomposable norms, defined 
in terms of a pair of subspaces Ad C Ad of The model is assumed to be in the subspace Ad, and the 














definition considers the so-called perturbation subspace which is the orthogonal complement of Ad. A 
norm R{-) is considered decomposable with respect to subspaces (Ad, Ad"'-) if R{6 + = R{0) + R{'^) for 

all 0 G Ad and 7 G Ad-*-. 

We show that for decomposable norms, the error set Er in our analysis is included in the error cone defined 


in II 29 I. In the current context, let /3 = 2, 0* G Af, then for any A = A^y^x + Aj;;^ G Er, we have 



R{9* + A) 

< R{9*) + ]^R{A) 

(24) 


R(9* + Aj^l_ + A_yy7) 

< R{9*) + -R{Aj^± + Aj^) 

(25) 


R{9* + Aj^^) - R{Aj^) 

< R{9*) + -i?(A_y;jx) + -R{Aj^) 

(26) 


R{9*) + R{Aj^^) - R{A^) 

< R{9*) + \r{A^^) + \r{Aj^) 

(27) 



< 3i?(A^). 

(28) 


where inequality (a) follows from the triangle inequality and (b) follows from decomposability of the norm. 
The last inequality is precisely the error cone in Il29l for 6* G Ad. As a result, for any A G Er, for 
decomposable norms we have 


R{A) — R{A^± + Aj^) < R{Aj^i_) + R{Aj^) < AR{Aj^) (29) 

Hence, the norm compatibility constant can be bounded as 

'ijj{Er) = sup ^ ^ ^ ^ ~ 4T'(Ad). (30) 

Aet IIAII2 - AgK I|A||2 - ueAdUO} l|u||2 

where 'I'(Ad) is the subspace compatibility in Ad, as used in ll29l . 


3 Bounds on the Regularization Parameter 

Recall that the parameter needs to satisfy the inequality 

An >/di2*(V£(r;Z")) . (31) 

The right hand side of the inequality has two issues: the expression depends on 6*, the optimal parameter 
which is unknown, and the expression is a random variable, since it depends on Z"^. In this section, we 
characterize the expectation E[R*(VC{6*] Z"-))] in terms of the Gaussian width of the unit norm ball Or = 
{u : R{u) < 1}, and further discuss its upper bounds. For ease of exposition, we present results for the case 
of squared loss, i.e., C{6*; Z^) = 7^\\y — X 9* with the linear model y = X6 + u, where ui is noise vector 
with i.i.d. entries. Under this setting, 

V£(r; Z^) = -X'^iy - X9*) = -X^uj , (32) 

n n 

which eliminates the dependency on the unknown 9*. Before presenting the results, we introduce a few 
notations. We let Amax(‘) denote the largest eigenvalue of a square matrix. We also recall the definition of 
the sub-Gaussian norm for a sub-Gaussian variable x, = supp>^ Bdl . 

From this section onwards, the analysis will take into account the randomness of the design X and the noise 
oj. Here we give a brief description of our assumptions on X and oj as follows. 
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Isotropic Sub-Gaussian Designs: the design matrix X G has independent sub-Gaussian rows where 
each row satisfies |||-^i |||^2 — ^ — Ipxp- Thus, the measure /r from which the rows Xi are 

sampled independently is an isotropic sub-Gaussian measure. 

Anisotropic Sub-Gaussian Designs: the design matrix X G W^^p has independent rows, and each row Xi 
is anisotropic sub-Gaussian with E[Xj'Xi] = S. Further, we assume that corresponding isotropic random 

< K. A simple special case of such an anisotropic sub-Gaussian 


vector Xi = XiT, satisfies 


X,: 


tp 2 


design is when Xi ^ N (0, S), where Xi = XiT, ~ N (0,1) so fhaf 


X: 


= 1 . 


i ’2 


Sub-Gaussian Noise: fhe noise oj has i.i.d. cenfered unif-variance sub-Gaussian enfries wifh 


UJi 


^2 < 


For convenience, we only use fhe shorthand in bold fonf fo specify fhe assumpfions. In fhe following fheorem, 
we characferize fhe expecfafion of R* {XC{9 *; X)) in ferms of Gaussian widfh of fhe unif norm ball w{Q,r). 


Theorem 3 Let LIr = {u : R{u) < 1}, and L be the squared loss. For sub-Gaussian design X and noise 
OJ, we have 

E [R*{VC{e*;Z^))] < , (33) 

s/n 

where the expectation is taken over both X and oj. The constant ^ is given by 


s/ ■^max(5i') 


if X is isotropic 
if X is anisotropic . 


Bounding fhe expecfafion of R*{XC{ 6 *] X)) gives us a rough scale of fhe regularizafion paramefer A^. In 
fhe nexf fheorem, we presenf a high-probabilify upper bound for R*{XC{9*] Z^)). 


Theorem 4 Let design X and noise oj be sub-Gaussian, and C be squared loss. Define f ll^lb. 

then for any r > 0, with probability at least 1 — ci exp min con^^ we have 

-y 1 

R* {XC{9*; Z^)) < y J {c^K • w{nR) + r) , (34) 

where c, cq, ci and C 2 are all absolute constants, and ^ is the same as in Theorem^ 

Bounding the Gaussian width w{Llji): In certain cases, one may be able to directly obtain a bound on the 
Gaussian width w{Qr). Here, we provide a mechanism for bounding the Gaussian width w{LIr) of the unit 
norm ball in terms of the Gaussian width of a suitable cone, obtained by shifting or translating the norm ball. 
In particular, the result involves taking any point on the boundary of the unit norm ball, considering that as 
the origin, and constructing a cone using the norm ball. Since such a construction can be done with any point 
on the boundary, the tightest bound is obtained by taking the infimum over all points on the boundary. The 
motivation behind getting an upper bound of the Gaussian width w{Qr) of the unit norm ball in terms of the 
Gaussian width of such a cone is because considerable advances have been made in recent years in upper 
bounding Gaussian widths of such cones ifTdl lTl. 


Lemma 3 Let LIr = {u : R{u) < 1} be the unit norm ball and Or = {u : R{u) = 1} be the boundary. 
For any 9 G Or, p{9) = sup 0 .R(^Q^^i \\9 — 9\\2 is the diameter of LIr measured with respect to 9. Let 
G{9) = cone(nij — 0) H p{ 9 )B 2 , i.e., the cone of {LIr — 9) intersecting the ball of radius p{9). Then 

w{^r) < inf w{G{9)) . (35) 

ogBr 
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Figure 4: Bounding the Gaussian width of a norm ball, e.g., corresponding to Li norm, by shifting the norm 
ball and using the width of the corresponding cone (Lemma |^. The approach allows one to directly use 
existing results on bounding Gaussian widths of certain cones. In some cases, it may be easier to directly 
bound the Gaussian width of the norm ball, rather than using the shifting argument. 


The analysis and results for An presented above can be extended to general convex losses arising in the 
context of GLMs for sub-Gussian designs and sub-Gaussian noise (see Sectionj^. . 


4 Least Squares Models: Restricted Eigenvalue Conditions 


The error bound analysis in Theorem [^depends on the restricted strong convexity (RSC) assumption. In this 
section, we establish RSC conditions for sub-Gaussian design matrices when the loss function is the squared 
loss. For squared loss, i.e., £(0; Z”) = ^||y — the RSC condition (20l becomes equivalent to the 

Restricted Eigenvalue (RE) condition (611291, since 


,5£(A, r) = -||y - X{e* + A)f - -|| 2 / - xe 

n n 

1 1 . ^, 

= -||AA||2 = - V(Ai,A)2. 

n n 

2 = 1 


* 112 


+ -{x^{y-xe*),x) 

n 


(36) 


so that the condition simplifies to 


sc{A,e*) 


n 

- V(A„A)2>^||A 
n 


(37) 


for all A £ Er. We make two simplifications which lets us develop the RE results in terms of widths of 
spherical caps rather than over the error set Er- Let he^ be the sample complexity for the RE condition over 
the set Er, so that for n > ue^ samples, with high probability 


inf -IIXAII^ 
AeEr n 




2 

2 ) 


( 38 ) 
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for some HEr > 0- Let Cr = cone{Er) and let nc^ be the sample complexity for the RE condition over the 
cone Cr, so that for n > ncr samples, with high probability 


inf -\\XA\\l > KCr\\X\\l , 


(39) 


for some kc^ > 0. Since Er C Cr, we have UEr < nCr- Thus, it is sufficient to obtain (an upper bound to) 
the sample complexity nCr, since that will serve as an upper bound to n^;^, the sample complexity over Er- 
Further, since Cr is a cone, the absolute magnitude || A ||2 does not affect the sample complexity. As a result, 
it is sufficient to focus on a spherical cap A = C,. H 5^“^. In particular, if ua denotes the sample complexity 
for the RE condition over the spherical cap A, so that for n > ua samples, with high probability 

inf -||Au ||2 > fc^||n||i , (40) 

u&A n 

for some ka > 0, then ua = nCr > nEr- Noting that ||u ||2 = 1 for u G Cr H SP~^, we consider sample 
complexity results for RE conditions the form 

1 1 

inf -||A:u ||2 = inf - > KA{n,p) (41) 

u&A n u&A n ^^ 

2=1 

where KA{n,p) > 0 with high probability for n > n^. In this section, we characterize sample complexity 
UA over any given spherical cap A, and establish RE conditions for isotropic and anisotropic sub-Gaussian 
design matrices X in terms of the Gaussian width w{A). 

Analysis of RE conditions for certain types of design matrices for certain types of norms, especially the Li 
norm, have appeared in the literature ||3[33l[33l- The RE/RIP conditions for independent isotropic Gaussian 
designs have been widely studied for the case of Li norm ifTOl 1^. The generalization to RE condition 
for correlated Gaussian designs for the special of Li norm was studied in ll33l . iffTIl consider the more 
general context of atomic norms, and RE condition analysis applies to any spherical cap A, with sample 
complexity results in terms of w{A), the Gaussian width of A. However, the analysis relies on Gordon’s 
inequality ifT^I^I^ . which is applicable only for isotropic Gaussian design matrices. Progress has been 
made on establishing RE conditions for sub-Gaussian designs for error sets/caps corresponding to specific 
norms such as Li ll4^ . In recent work, RE conditions were developed for anisotropic sub-Gaussian designs 
for the Li norm ||34|| . Further, recent work have pointed out the differences between the RE and the RIP 
condition, which gives a two-sided bound on quadratic forms of random matrices ll29l . In particular, while 
the RE condition is sufficient for structured estimation, the RIP results are stronger and may have higher 
sample complexity. 

In the following, we establish the stronger RIP results for any spherical cap A and any sub-Gaussian design 
matrix, handling the isotropic and anisotropic cases separately. The special case of Gaussian design matrices 
are automatically covered by the sub-Gaussian results, and results such as Gordon’s inequality can be viewed 
as a special case. All results are in terms of w{A), the Gaussian width of A, even for sub-Gaussian designs. 
In fact, all existing RE results do implicitly have the width term, but in a form specihc to the chosen norm l|33l 
1341 . The analysis on atomic norm in 1(1411 has the w{A) term explicitly, but the analysis relies on Gordon’s 
inequality |[T9H20ll24l . which is applicable only for isotropic Gaussian design matrices. 

The proof technique we use is an application of generic chaining Il37ll38ll . The specihc form we utilize was 
originally developed in ||2Tl|28l. The main idea is to pose the RIP condition as a bound on the supremum of 
a suitable stochastic process, so that generic chaining can be invoked to obtain a bound. The key difference 
between our RIP analysis and much of the existing literature on RE conditions, which use specialized tools 
such as Gaussian comparison principles ll^ |29]| or analysis geared to a particular norm ll^ . is the use 
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of generic chaining which simplifies the analysis considerably. Further, the RIP results can be viewed as 
a generalization of the celebrated Johnson-Lindenstrauss (JL) lemma ifT/l . and the interested reader can 
explore these connections in fTDi . 

Isotropic Sub-Gaussian Designs: We first consider the case where the design matrix X G has 

independent sub-Gaussian rows where each row satisfies |||-^i |||.02 — ^ — ^pxp- Thus, the 

measure /r from which the rows Xi are sampled independently is an isotropic sub-Gaussian measure. 


Theorem 5 Let X be a design matrix with independent isotropic sub-Gaussian rows, i.e., |||^j |||^2 — ^ 
E[XiXj'] = Ipxp- Then, for absolute constants p,c > 0, with probability at least (1 — 2 exp(—r/m^(A))), 
we have 


sup 

u^A 


-||Xu||2-l 

n 


= sup 

uGA 


n 




2 = 1 


< c 


w{A) 


n 


(42) 


or, equivalently. 


l-c"^ < inf -\\Xu\\‘^ < sup -||Xu||2 < 1 + c^^ 

/n ~ u&A n" ~ ueA n " - ^ 


(43) 


As a result, for n > c^w‘^{A), the RE condition: mlu^A > 1 — cw{A)/^/n > 0 is satisfied with 

high probability for any sub-Gaussian design matrix. More generally, choosing e = cw{A)/^/n, one can 
write the result in a traditional RIP form iflOll . 


Anisotropic Sub-Gaussian Designs: We now consider the case where the design matrix X G has 
independent rows, and each row Xi is anisotropic sub-Gaussian with E[Xj'Xi] = S. Further, we assume 
that corresponding isotropic random vector Xi = XiYX^!'^ satisfies Xi < k. A simple special case 


V'2 


of such an anisotropic sub-Gaussian design is when Xi ~ A^(0, S), where Xi = XjS ~ A^(0,I) so 
= 1. The result below characterizes RIP-style property of any such anisotropic sub-Gaussian 


that 


X,; 


designs. 


V’2 


Theorem 6 Let X be a design matrix with independent anisotropic sub-Gaussian rows, i.e., E[XTXi] = S 
anr/III < k. Then, for absolute constants p, c> 0, with probability at least {1—2 exp{—pw'^ {A))), 

we have 


sup 

iiSA 


1 1 


n 




||Xu|p-l 


= sup 

uGA 


1 1 


n 




Y,{X,,uf-l 


2=1 


< c 


w{A) 


n 


(44) 


Further, 


A^in(S|A) < inf -||Xn||2 < sup -||Au||2 < A„,ax(S|A) f 1 + V 

V yn J udA n ueA n V / 


where 


Amin(S|A) = inf u^TiU , and Amax(S|A) = sup u^Eu 
u&A 


(45) 

(46) 


are the restricted minimum and maximum eigenvalues ofE restricted to A C. SP 


Thus, for the anisotropic case, the RIP is with respect to the restricted minimum and maximum eigenvalues 
corresponding to A C SP~^. For the special case when A = SP~^, we have Amin(A]|A) = Amin(S), the 
minimum eigenvalue, and Amax(5^|^) = Amax(S), the maximum eigenvalue of E. Further, when S = I, 
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we get back the result in Theorem [TT] Finally, it is instructive to compare the above result to existing char¬ 
acterizations of the RE condition for anisotropic Gaussian ll^ and anisotropic sub-Gaussian lf34l designs, 
focused on the Li norm. The result in Theorem[T^is a more general RIP result, applies to any spherical cap 
A, and is in terms of w{A), the Gaussian width of the spherical cap A. 


5 Examples and Applications 


In this section, we give examples of the analysis from previous sections for three norms: Li norm, group 
sparse norm, and L 2 norm. The summary of the results is given in Table[T] Other examples can be constructed 
for norms and error sets with known bounds on the Gaussian widths and norm compatibility constants lIT^ 
5^ . and more general ways to bound the Gaussian widths and norm compatibility constants have been 
developed in 1251 [T^ . 

Li Norm: Assume that the statistical parameter 9* is s-sparse, and note that ||0*||i < \/s|| 6 **|| 2 - Since Li 
norm is a decomposable norm following the result in p^ , we have ^{Er) < AA>{M.) = 4\/s. 

Applying Lemma|^ let 0 be a 1-sparse vector, and p{0) = 2, then w{Qji) can be bounded by 


w{nR) < inf w{G{e)) = w{G{9)) = O (y^logp) , 


(47) 


where (a) is obtained from the fact that Gaussian width of G{9) with 0 be a s-sparse vector is y/2s log(2) -|- |s 
1. See Figure|^for more details. From Theorem|^and ( |47l ), the bound on An is 


w{Qr) 

A„ < c-= (J 


n 


\ogp 


n 


(48) 


Hence, the recovery error is bounded by 


I A II ^ ^(-E'r)An ^ 

|An||2 < C 3 - = O 


s logp 


n 


(49) 


which is similar to the results obtained in well known results lT4ll29l . 

Group Sparse Norm: Suppose that the index set {1, 2, • • • ,p} can be partitioned into a set of T disjoint 
groups, say Q = {^i, ^ 2 , • • • j Qt}- Define (1, z/)-group norm for a given vector u = (i^i, • • • , t/t) £ 
[ 1 , oo]"^ as 


I11 Q — 




(50) 


t=l 


As shown in ||29]| Group norm is a decomposable norm. For a given subset Sg C {1,..., T} with cardinality 
IS'pI, define the subspace A{Sg) = {a G \ag^ =0, Vf ^ Sg }. Fet vt > 2, then we have 


t&Sg teSg 

Hence, from ( |30l ) and ( |5T| ) we have 


(51) 


(52) 

Applying Femma|^ define 6 with 1-active group, and p{9) = 2, then w{Qr) can be bounded by 


winn) < inf w{Gi9)) = w{G{e)) = O (y/m + logT) , 

0e0H ^ '' 


(53) 


14 












R{u) 


K := 

max{(l 

■^{Er) 

||A„||2:=C35^^ 

Li 

o(/¥) 

0(1) if n > C 2 w'^{A) = 0{s logp) 

Vs 

o(v^) 

Group sparse 

o(\A^) 

0(1) if n > C 2 W^{A) = 0{sg{m + logT)) 


Q ^^/ss("i+logT)^ 

L 2 

0(v/f) 

0(1) if n > C2w'^(A) = 0{p) 

1 

o{Vl) 


Table 1: A summary of values for the regularization parameter Xn, the RE condition constant k, the norm 
constant 'i>{Er) and recovery bounds || An ||2 for ii, I 2 and group norms in case of Gaussian Design matrix 
with Gaussian noise. All results are given up to constants with more emphasis on the scale of the results. 


where m = max|^i| and (a) is obtained from the fact that Gaussian width of G{9) where 6 has k active 
group is \j2k{m + log(T — k)) + k iflAl . From Theorem[^and (^1, the bound on is 

/ m + log T \ 


An < C-= (J 


n 


n 




Hence, the recovery error is bounded by 


(54) 


I A II ^ ^ 

|An||2 < C 3 - — O 


sg{m + logT) \ 


n 




which is similar to the results obtained in previous works ifldl l29l . 

L 2 Norm: With L 2 norm as the regularizer, the norm constant is obtained as 

^{Er) = sup 1^11^ = 1. 

AeEr 11^112 

Applying Lemma|^ set p{9) = 1, then w{Qji) can be bounded by 

w{Qr) < inf w{G{9)) = O {^/p). 

0 & 0 R 


From Theorem]^ and ([57]), the bound on An i 


IS 


An < C-= (J 


n 


Hence, the recovery error is bounded by 


I A II ^ '^iEr)Xn ^ 

|An||2 < C 3 - = O 


(55) 


(56) 


(57) 


(58) 


(59) 
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6 Generalized Linear Models: Restricted Strong Convexity 


In this section, we extend our results to estimation with norm regularization in the context of generalized 
linear models (GLMs) HUl. Assume that the conditional distribution of the response y* conditioned on the 
covariates Xi is an exponential family distribution: 


p{yi\x,-e*) =p{yi\{Xi,e*)) = eMyi{Xi,e*)-p{{Xi,e*))]. m 

where 

(p{{Xi, 9*)) = log(^j exp{yi{Xi, 9*)} dy^ (61) 

is the log-partition function |l8l|4l|H|^In GLMs, the conditional distribution of the response yi is character¬ 
ized by an exponential family distribution p{yi\'qi) with natural parameter rji = {Xi, 9*) determined by the 
covariates Xi and the parameter 9*. It is easy to verify that the gradient of the log-partition function w.r.t. the 
natural parameter rji = {Xi, 9*) gives the expectation of the response HHSl, i.e.. 


= V^x,^g.^ip{{Xi,9*)) = E[yi\{X„9*)] . (62) 

For estimating 9*, the loss function corresponding to GLMs typically consider the negative log likelihood of 
the conditional distribution: 

n 

C{9-, Z^) = — \ogpiyi\Xf, 9*) = - y^{p{{X„ 9)) - y,{Xi, 9)) . (63) 

n n 

2 = 1 

In the current context, we assume 9* to be sparse/structured, and the structure can be suitably captured by a 
norm R{-). Then, the estimation of 9* with norm regularization takes the form: 

1 ” 

9x„ = argmin C{9] Z^) + XnR{9) = argmin - ^{ip{{Xi, 9)) - yi{Xi, 9)) + XnR{9) ■ (64) 

esKp eeRp n ^ 

Noise in the context of GLMs is simply the deviation of a specific response y* from the conditional mean, 
i.e., (jJi = E[yi\Xi] — yi. Popular examples of GLMs come from suitable choices of the conditional distri- 
bution, e.g., when p(yi| (Aj, 0)) is Gaussian so that p{{Xi,9)) = ^ ^ » Bernoulli so that (p{{Xi,9)) = 

log(l -|-exp((Aj, 9))), and Poisson where p{{Xi, 9)) = exp((Aj, 9)), respectively yielding least squares re¬ 
gression, logistic regression, and Poisson regression loss functions. Next, we provide the key results needed 
to characterize the regularization parameter A„ and restricted strong convexity (RSC) in the context of GLMs. 
The non-asymptotic bound on the estimation error then follows from the general result in ( |2^ . 

Bounds on the Regularization Parameter: Following the general analysis from Section the regulariza¬ 
tion parameter needs to satisfy the condition: Xn > j3R*{ygC{9*-, Z^)) for any fixed j3 > 1. For GLMs, 

^ n ^ n 1 ^ 1 

XeC{9*-,Z^) = — Y^y^Xi + -Y,XiX(x^^e*M{X^,9*)) = - ^ Ai(S[y|A,]-y,) = -X^uj, (65) 

Th Th T\ Th 

2=1 2=1 2=1 

where we have used the fact, V(^Xi,e*)‘f{{Xi,9*)) = E[yi\{Xi,9*)] and oj = E[yi\{Xi,9*)\ — yi. Thus, 
the form of VgC{9*; Z"') is the same as that in Section]^ Assuming the design matrix X and noise ui 
are sub-Gaussian, a characterization of An follows from Theorems and in Section In particular, 
E[XgC{9*; Z^)] = Q( ^^) corresponding high probability concentration results, and it suffices fo 

have An to be of this order. 

^Note that for GLMs over discrete responses yi, the integration needs to be suitably changed to summation Eia. 
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Restricted Strong Convexity: By definition, the restricted strong convexity considers 

n 

5C{u,e*) = C{9*+u) - C{9*) - {VC{9*),u) = - y^V\{{9\X,) +-fi{u,Xi)){u,Xif , 

2 = 1 

where 7 * G [0,1], and where the last equality follows from a direct application of the mean value theo¬ 
rem lf35]| . Since the log-partition function y? is of Legendre type ||3j|4l|8l, the second derivative is 

always positive. Since the RSC condition relies on a non-trivial lower bound for the above quantity, the 
analysis will consider suitable compact sets where is bounded away from zero by a constant. In par¬ 

ticular, for a suitable constant T, we consider the sets {Xi\\ {Xi, 0*)| < T} and {Xi\\{Xi^u)\ < T}. For Xi 
lying in these sets, the argument a = {Xi, 9*) -)- Xi) of the second derivative satisfies |a| < 2T, which 
is the compact set of interest. Within the compact set, I = iip{T) = min|a|< 2 'r is bounded away 

from zero. Outside the compact set, we will only assume > 0. Based on the above construction, we 

have 

£ ^ 

5£{u,9*) >-Y,{Xi,u)H[\{X,,9*)\<T]I[\{Xi,u)\<T] . (66) 

2=1 

The quadratic form based lower bound allows us to establish RSC conditions for GLMs with isotropic 
subGaussian design matrices by building on results from Section|^for RE conditions for squared loss. As a 
result, the sample complexity of the RSC condition is also expressed in terms of the Gaussian width of the 
spherical cap A derived from the error set. The analysis can be suitably generalized to anisotropic design 
matrices using techniques discussed in Section 

As before, we consider u G A C 5^“^ so that ||m ||2 = 1. Assuming X has isotropic sub-Gaussian rows 
with lllATillL^ ^ {Xi,u) are sub-Gaussian random variables with sub-Gaussian norm at 

most Ck 1421. Denote by ei and £2 the probability that {Xi,u) and {Xi,9*) exceeds some constant T, 
i.e., ei(T;u) = P{\{Xi,u)\ > T} < e ■ = ei, ande 2 (r;r) = P{\{Xi,9*)\ > 

T} < e ■ exp(—C 2 T^/C^k^) = £ 2 , where £i = £i{T,k) and £2 = £ 2 {T;k) are uniform upper bounds 
on the individual tail probabilities. The result we present below is in terms of the above defined constants 
£ = £(f{T), £i = £i{T, k) and £2 = £ 2 {T, k) for any suitably chosen T. 

Theorem 7 Let X G be a design matrix with independent isotropic sub-Gaussian rows such that 

|||-^i |||(^2 — fa’' ^ — S'P~^ for suitable constants r],c > 0, with probability at least 

1 — 2 exp {—pnP‘{A)^, we have 

inf dC{u, 9*) > £p^ f 1 — c/tf ^ ^ (- 57 ^ 

U&A - \ yjn ) 

where ff = inf^gAP^ with = E[{Xi,u)H[\{Xi,9*)\ < r]I[|(A*, tt)| < T]], and ki = 

The form of the result is closely related to the corresponding result for the RE condition inf^g^l 
considered in Section Note that RSC analysis for GLMs was considered in ||2^ for specific norms, 
especially Li, whereas our analysis applies to any set A C Sf~^, hence to any norm, and the result is in 
terms of the Gaussian width w{A) of A. Further, following arguments in Section the RE analysis for 
GLMs can be extended to anisotropic subGaussian design matrices. 


7 Conclusions 

The paper presents a general set of results and tools for characterizing non-asymptotic estimation error in 
norm regularized regression problems. The analysis holds for any norm, and subsumes much of existing 
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literature focused on structured sparsity and related themes. The work can be viewed as a direct generaliza¬ 
tion of results in ll29l . which presented related results for decomposable norms. Our analysis illustrates the 
important role Gaussian widths, as a measure of size of suitable sets, play in such results. Further, the error 
sets for regularized and constrained versions of such problems are shown to be closely related @. 

While the paper presents a unified geometric treatment of non-asymptotic structured estimation with regu¬ 
larized estimators, several technical questions need further investigation. The focus of the analysis has been 
on thin-tailed distributions, and the RE/RSC type analysis presented really gives two sided bounds, i.e., RIP, 
showing that thin-tailed distributions do satisfy the RIP condition. For heavy tailed measurements, the lower 
and upper tails of quadratic forms behave differently ll^l27ll . and it may be possible to establish geometric 
estimation error analysis for general norms, some special cases of which have been investigated in recent 
years Il22l l23l ITTll . Further, the sample complexity of the phase transitions in the RE/RSC conditions for 
anisotropic designs depend on the largest eigenvalue (operator norm) of the covariance matrix, making the 
estimator sample inefficient for highly correlated designs. Since real-world several problems, including spa¬ 
tial and temporal problems, do have correlated observations, it will be important to investigate estimators 
which perform well in such settings lITSl . Finally, the focus of the work is on parametric estimation, and it 
will be interesting to explore generalizations of the analysis to non-parametric settings. 


Appendix 

A Background and Preliminaries 

We start with a review of some definitions and well-known results which will be used for our proofs. 


A.l Gaussian Width 

In several of our proofs, we use the concept of Gaussian width ll20l[T4]l . which is defined as follows. 
Definition 1 (Gaussian width) For any set A € M^, the Gaussian width of the set A is defined as: 


w{A) = Eg 


sup(p,m) 


( 68 ) 


u&A 

where the expectation is over g ~ A(0,lpxp). a vector of independent zero-mean unit-variance Gaussian 
random variable. 


The Gaussian width w{A) provides a geometric characterization of the size of the set A. We consider three 
perspectives of the Gaussian width, and provide some properties which are used in our analysis. First, con¬ 
sider the Gaussian process {Zu] where the constituent Gaussian random variables = (f, g) are indexed 
by rt G A, and g ~ A'(0,lpxp)- Then the Gaussian width w{A) can be viewed as the expectation of the 
supremum of the Gaussian process {Zt}. Bounds on the expectations of Gaussian and other empirical pro¬ 
cesses have been widely studied in the literature, and we will make use of generic chaining for some of our 
analysis llT/l 1^ 17112411. Second, {u,g) can be viewed as a Gaussian random projection of each u G A to 
one dimension, and the Gaussian width simply measures the expectation of largest value of such projections. 
Third, if A is the unit ball of any norm R{-), i.e., A = {x G | R{x) < 1}, then w{A) = Eg[R*{g)] by 
definition of the dual norm. Thus, the Gaussian width is the expected value of the dual norm of a standard 
Gaussian random vector. For instance, if A is unit ball of Li norm, n;(A) = F;[||p|U]. 

Below we list some simple and useful properties of the Gaussian width of A C Rf; 
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Property 1: w{A) < w{B) for A B. 

Property 2: w{A) = t(;(conv(A)), where conv(-) denotes the convex hull of A. 

Property 3: w{cA) = cw{A) for any positive scalar c, in which cA = {cx \ x G A}. 

Property 4: t(;(r74) = w{A) for any orthogonal matrix T G 
Property 5: w{A + b) = w{A) for any A C Rf and fixed b G 

The last two properties illustrate the Gaussian width is rotation and translation invariant. 

A.2 Sub-Gaussian and Sub-exponential Random Variables (Vectors) 

In the proof, we will also frequently use the properties of sub-Gaussian and sub-exponential random variables 
(vectors). In particular, we are interested in their definitions using moments. 

Definition 2 Sub-Gaussian (sub-exponential) random variable: We say that a random variable x is sub- 
Gaussian (sub-exponential) if the moments satisfies 

[E\x\p]p<K2^ {[E\x\p]p < Kip) (69) 

for any p > 1 wifh a consfanf K 2 (Ki). The minimum value of K 2 (Ki) is called sub-Gaussian (sub- 
exponenfial) norm of x, denoted by |||a:|||^^ (llklll^i)- 

Definition 3 Sub-Gaussian (sub-exponential) random vector: We say that a random vector X in is 
sub-Gaussian (sub-exponential) if the one-dimensional marginals {X, x) are sub-Gaussian (sub-exponential) 
random variables for all x G M”. The sub-Gaussian (sub-exponential) norm of X is defined as 

111^11^2= IK^>®)IIv >2 (111^11^1= sup ||(V,x)||^,) (70) 

a;eS"-l 


The following definitions and lemmas are from BTI . 


Lemma 4 Consider a finite number of independent centered sub-Gaussian random variables Xi. Then 
Xi is also a centered sub-Gaussian random variable. Moreover, 






I Vi 


||2 

\\tjj2 


1p2 


(71) 


Lemma 5 Let Xi ,..., Xn be independent centered sub-Gaussian random variables. Then X = [Xi ,..., X^) 
is a centered sub-Gaussian random vector in M”, and 

||| V |||^2 < Cmax ||| Ai |||^2 ( 72 ) 

where C is an absolute constant. 

Lemma 6 Consider a sub-Gaussian random vector X with sub-Gaussian norm K = maxj ||| Vi|||^^, then, 

Z = (X, a) is a sub-Gaussian random variable with sub-Gaussian norm IH-Z'IH^^ < (7X110112. 
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Lemma 7 A random variable X is sub-Gaussian if and only if is sub-exponential. Moreover, 




< 2111X11 


( 73 ) 


1^1 — "IV’2 

Lemma 8 If X is sub-Gaussian (or sub-exponential), then so is X — EX. Moreover, the following holds, 

(74) 


\\X-EX\\\^^<2\\\X\\\^^, 


|X - 77X111^^ <2|||X|||^^ 


B Restricted Error Set and Recovery Guarantees 

Section [2] is about the restricted error set. Lemma [T] characterizes the restricted error set. Theorem [T] estab¬ 
lishes the relation between the constrained and restricted error sets. In particular, we prove that the Gaussian 
width of the regularized and constrained error sets (cone) are of the same order. Starting with the assumption 
that the RSC condition is satisfied Lemma and Theorem derive results on the upper bound on the L 2 
norm of the error. 

We collect the proofs of the different results in this section. 


B.l The Restricted Error Set 


Lemma[T]in Section [^characterizes the set to which the error vector belongs. We give the proof of Lemma 
[T] below: 

Lemma For any /3 > 1, assuming 

\n>l3R*{VC{e*-Z^)) (75) 


where R*{-) is the dual norm of R{-). Then the error vector A„ = — 9* belongs to the set: 


Er = Er(9*,P) = <^ A e 


R{9* + A) < R{9*) + ^RiX) 

H 


(76) 


Proof: By the optimality of 9\^ = 9* -\- A„, we have 

C(9* + kn) + XnR{9* + An) - {€(9*) + XnRi9*)} < 0 . (77) 

Now, since C is convex, 

C{9* + A) - C{9*) > (V£(r),A) > -|(V£(r),A)| . (78) 

Further, by generalized Holder’s inequality, we have 

|(V£(r), A)| < R*{VC{9*))R{A) < ^i7(A) , (79) 

where we have used Xn > f3R*{VC{9*-, Z^)). Hence, we have 

£{9*+ An)-E{9*)>-jR{An) . (80) 

As a result, 

A„ |i7(r + A„) - Ri9*) - ^^(An)| < 0 . (81) 

Noting that > 0 and rearranging completes the proof. ■ 
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B.2 Relation between the Constrained and Regularized Error Cones 

In this section we show that the sizes of the regularized and constrained error sets are of the same order. 
Recall from Ifldil . that the error set for the constrained setting for atomic norms is a cone given by: 

Cc = Ccie*) = cone{Ec) = cone {AgRP \ R{6 * + A) < R{9*)} . (82) 

The error set Er is given by: 


Er = Erie*,P) = <^ A e 


R{e* + A) < R{9*) + ^^(A) 

Below we provide the proof of Theorem [T] 

Theorem]^ Let aP = EcCi pBP, aP = ErCi pRP, and AP = Cc n pBP, where pBP = {ri|||?x ||2 < p} 
is the L 2 ball of any radius p > 0. Then, for any /3 > 1 we have 


w{AP) < w{AP) < ( 1 + ) w{AP) , 


sup (a,5f) 


(83) 

, where g is an 


/3-1 p 

where w{A) denotes the Gaussian width of any set A given by: w{A) = Eg 
isotropic Gaussian random vector, i.e., g ~ A^(0, Ipxp)- 

Proof: The first inequality simply follows from the fact that Ec C Er and Property 1 of Gaussian width. 
For the second part, from triangle inequality, we have 


R{A) < R[9* + A) + R{9*) . 


(84) 


Then, 

Er{9\p) = |a e RP 

c |a e RP 

= |a G RP 
= |a G R^’ 

Let Cr{9*, 13) denote the following set 


R{9* + A) < R{9*) + ^^(A) 

P 

R{9* + A) < R{9*) + ^Ri9* + A) + ^R{9*) 
P P 


1 


13 


1 - - R{9* + A) < 1 + - ii(0 


1 


/3 


R{9* + A) < ^^R{9*) } = Er{9\P) . 


-9* 


Cr = Cri9*,(3) = cone |A - \A G Er j + 


(85) 


It follows naturally from the construction that Er G Cr- 

Let aP = Cri9*,f3) n pBP. Since Er{9*,/3) C Cr{9*,j3), we have w{aP) < We define two 

additional sets for our analysis: 


= AP - = |a G RP A + G AP I , 

dP = CA0*,f3) n = {a G 


AGCe,||A||2 <(p+— )||0^ 112 


( 86 ) 


(87) 
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(C) (d) 



(e) 


Figure 5: Error cone for Li norm in two dimensions: (a) The LI norm ball in two dimensions; (b) The 
constrained error cone Ac; (c) The regularized error cone Aj. and the shifted cone 3^; (d) The constrained 
error cone Ac and the shifted constrained error cone Dc, (e) All error cones. 
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Following Property 3 of Gaussian width, we have 


Further, using Property 5 of Gaussian width, we have 

w{A\F^) = w{^P^) . 

From the construction it is clear that C . Hence we have 

Then, we have 

= (l + ^44) 

By noting that w{Ai^'^) < w{Ai^^), we complete the proof. 


( 88 ) 


(89) 


(90) 


B.3 Recovery Guarantees 

Lemmaj^and Theorem|^in the paper are results which establish recovery guarantees. The result in Lemma 
[^depends on 6 *, which is unknown. On the other hand Theoremgives the result in terms of quantities like 
An and the norm compatibility constant ^{Er) = sup^g^;^ which are easier to compute or bound. In 
this section we give proofs of Lemma|^and Theoremj^ 

Lemmaj^ Assume that the RSC condition is satisfied in Er by the loss £(•) with parameter k. With An = 
9x^ — 9*, for any norm R{-), we have 

IIAnIb < -||V£(r) + XnVRi9*)\\2 , (91) 

K 

where Vi?(-) is any sub-gradient of the norm R{-). 

Proof: By the RSC property in E^-, for any A £ Er we have 

C{9* + A) - C{9*) > {VL{9*), A) + k\\A\\1 . (92) 

Also, recall that any norm is convex, since by triangle inequality, for t £ [0,1], we have 

R{t9i + (1 - t)92) < R{t9i) + R{{1 - t)92) = tR{9i) + (1 - t)R{92) . (93) 

As a result, for any sub-gradient VR{9) of R{9), we have 

R{9*+ A)-R{9*)>{A,VR{9*)) . (94) 


Adding ( [92] ) and ( |94] ), we get 

C{9* + A) - C{9*) + Xn{R{9* + A) - R{9*)) > (V£(r) + XnVR{9*), A) + k|| A||| (95) 
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Now, by Cauchy-Schwartz inequality, we have 


KV£(r) + AnVi?(r),A)| < ||v£(r) + A„vi?(r)||2||A||2 
^ (V£(r) + A„Vi?(r),A) >-||V£(r) + A„Vi?(r)||2||A||2 . (96) 


Using (961 in (95 I, we have 


-F(A) = £(r + A) - £(r ) + Xn{RiO* + A) - R{e*)) 
> -||V£(r) + A„Vi?(r)||2||A||2 + K\\Ag 

\\vc{e*) + XnVRiegh' 


= kIIAI 


lAllo — 


Now, since J^(A„) < 0, from (97 1 , we have 


^nlb < 


\\VCi9*) + XnVR{9*)\\2 


(97) 


(98) 


which completes the proof. 


Theorem Assume that the RSC condition is satisfied in by the loss £(•) with parameter k. With 
An = 9\^ — 9*, for any norm R{-), we have 




(99) 


Proof: By the RSC property in E^, we have for any A £ Er 

C{9* + A) - £(r) > (VL(r), A) + k\\A\\1 . 
By definition of a dual norm, we have 

|(V£(r), A)| < R*{V£{9*))R{A) . 
Further, by construction, R*{VC{9*)) < implying 

|(V£(r),A)|<^7?(A) 

^ (V£(r),A)>-^72(A). 

Further, from triangle inequality, we have 

R{9* + A) - R{9*) > -R{A) 


Adding (102 1 and (1031, we have 


A. 


( 100 ) 


( 101 ) 


( 102 ) 


(103) 


/■(A) = C{9* + A) - C{9*) + Xn{Ri9* + A) - R{9*)) >-'-jR{A) + k||A||2 - XnR{A) 


= kIIAIIo — A. 


2 \ 1 + /3 




R{A) 


(104) 
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By definition of the norm compatibility constant we have -R(A) < ||A|| 2 V’Er implying —i?(A) > 
— II A|| 2 '!/'e,.- Plugging the inequality back into (104), we have 


J'(A) > k||A||2 { ||A||2 - 


(105) 


Since J^(A„) < 0, we have 


which completes the proof. 


I A II ^ 1 + /5 , 

A„ 2 < - a - VEr 5 

p K 


(106) 


C Bounds on the Regularization Parameter 

In this section, we prove Theorem and in Section of the paper. The regularization parameter should 
satisfy the condition Xn > PR*{'VC{6*; Z'^)). In Theorem|^we establish the upper bound on the expectation 
E[R* {'VC{9 *; Z^))] in terms of the Gaussian width of the unit norm ball for least squares loss and Gaussian 
designs. In Theorem|^we show that R*{'VC{9*-, Z^)) concentrates sharply around its expectation. 

C.l Proof of Theorem |3] 

To prove Theoremwe first need the following theorem from generic chaining. 

Theorems Let LIr = {u : R{u) < 1} be the unit norm ball of R{-). Assuming h is any centered sub- 
Gaussian random vector with |||/i |||^2 — we have 


E 


sup (/l, u) 
R{u)<l 


< rjoKW (Qr) , 


where rjQ is a universal constant. 

Proof: The quantity i?[supj:j(„)<^ {h, u)] can be considered the “sub-Gaussian width” of LLr, the unit norm 
ball, since it has the exact same form as the Gaussian width, with h being a sub-Gaussian vector instead of a 
Gaussian vector. Next, we show that the sub-Gaussian width is always bounded by the Gaussian width times 
a factor proportional to k. 

Consider the sub-Gaussian process Y = {Yu}, Yu = {u, h) indexed by rt G LIr, the unit norm ball. Consider 
the Gaussian process X = {Xu}, Xu = {u, g), where 5 ~ A(0,1), indexed by the same set, i.e., u G Qr, 
the unit norm ball. First, note that \Yu — Yu\ = \{h,u — n)|, so that by the concentration of sub-Gaussian 
random variable lldTl Equation 5.10], we have 

P {\Yu -Yu\>e)<e- exp ( - , (107) 

where c > 0 is an absolute constant. As a result, a direct application of the generic chaining argument for 
upper bounds on such empirical processes |[37l Theorem 2.1.5] gives 


E 


sup \Yu-Y,: 

U,V 


< ViP 


sup 

U 


giw{LlR) , 


(108) 
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where rji is an absolute constant. Further, since {Yu} is a symmetric process, from lITTl Lemma 1.2.8], we 
have 


E 

As a result, with ryo = we have 

sup \Yu 

-Yu\ 

= 2E 

E 

sup {h,u) 
R{u)<l 

= E 

SUpFn 

u 


supy„ 


< rjowi^ln) 


(109) 


( 110 ) 


That completes the proof. 

Now we turn to the proof of Theorem]^ 

Theorem 1^ Let Qji = {u : R{u) < 1}, and C be the squared loss. For sub-Gaussian design X and i 
oj, we have 

E [B*{VC{e*-Z'^))] < , (111) 


noise 


n 

where the expectation is taken over both X and u). The constant ^ is given by 

^ _ f 1 if X is isotropic 

\ A^/A ma ,^(S) if X is anisotropic . 

Proof: For least squares loss, we first note that 


E[R*{X£{e*-,Z^))] = E 

R*{-X^uj) 

= E 

sup {—X'^ui,u) 


n 


R{u)<l R 


= E 


< E 


— IIcjIU • E 
n 


/ yT ^ 

sup ( X -——, u 


R(u)<l 


\ 0 J \\2 


UJ 


— ||a ;||2 • sup E 
'R 'ueS’i-i 


1 


= —E [||tu|| 2 ] • sup E 


sup (x'^VjU) 
R{u)<l ' ' 

sup (x'^v,u\ 

R{u)<l \ ' 


( 112 ) 


£’[||a;|| 2 ] is the expected length of a centered sub-Gaussian random vector, which can be easily bounded 
using Jensen’s inequality, 

F;[||a;||2] < yjE[\\u:\\l]=V^. (113) 


Then we focus on E sup {X'^v, u 

R{u)<l 

consider the random variable (/i, z) for any fixed ^ G §P Note that 


for any fixed u G S” Let h = X'^v be a random vector, and 
ixed ^ G N( 

{h,z) = {v,Xz) 


Case 1. If X is independent isotropic, then Xz has i.i.d. centered sub-Gaussian entries with ?/)2 norm at most 
K. By Lemmapl we know that {v,Xz) is sub-Gaussian with ||| (u, < Ck, where C is an absolute 


constant. Hence his a sub-Gaussian random vector with |||/i |||^2 — Using Theoremj^ we conclude that 


for any u G S' 


n—1 


E 


sup (x'^v,u\ 

R[u)<l^ ' 


< pqCk ■ w{LIr) . 


(114) 
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Case 2. If X is independent anisotropic, then Xz has i.i.d. centered sub-Gaussian entries with V ’2 norm at 
most KY^Amax(5^)> where Amax(5^) is the largest eigenvalue of S. By the same argument as Case 1, we have 


E 


sup (x'^v,u\ 
R{u)<l^ ' 


< rjoCK-s/k 

max (S) • w{Q.r) . 

Letting j] = rj^C and combining (|114i, (1131 and (1151, we complete the proof. 


(115) 


C.2 Proof of Theorem m 

To prove Theorem]^ we also need the following result from generic chaining. 

Theorem 9 Let LIr = {u : R{u) < 1} be the unit norm ball of R{-). Assuming h is any centered sub- 
Gaussian random vector with lll/illl^^ — have for any r > 0, 


P sup (/i, u) > vqrw{VIr) + t < 1^1 exp — 
\R{u)<l / V 






(116) 


where vq, and V 2 universal constants, and f = supj:j(„)<;^ ll'^lb- 


Proof: Consider the sub-Gaussian process Y = {Yu},Yu = {h,u) indexed by u G Qr. By the same 
argument in the proof of Theorem we have 


P {\Yu - lAl > e) < e • exp ( - 


ce 


K^Wu — nip 


and 


E 


sup \Yu-Yv 


= 2E 


supFu 


(117) 


(118) 


Then a direct application of llTTl Theorem 2.1.5] and |[38l Theorem 2.2.27] gives us (1161. 


Theoremj^ Let design X and noise oj be sub-Gaussian, and C be squared loss. Define f = sup^(„)<i ||tt|| 2 , 
then for any r > 0, with probability at least 1 — ci exp min we have 


-I- 1 

R* (V£(r; Z^)) < ^ {c^K • w{Qr) + r) , 


n 


(119) 


where c, cq, ci and C 2 are all absolute constants, and ^ is the same as in Theorem^ 
Proof: We only show the case for isotropic X, where ^ = 1. Note that 


P( 72* {VC{e*-Z^)) > \rJ^l±l{ciK-w{DR) + T) 


n 


= P (^llculla • 72 
< P ' 


X^uj\ 


|W||2 


^ > y/(2P^~+T)n {CK ■ w{LIr) + r)^ 


( 120 ) 


(ll^lb > V (27f2 + l)re') + sup P [R* [X'^v) > CK ■ w{LIr) + t) , 


27 


























where the last inequality uses the union bound. We first prove the bound for ||a;|| 2 . Since oo consists of i.i.d. 
centered unit-variance sub-Gaussian elements with < K, ojf is sub-exponential with |||wj|||. 0 ^ < 

2K^. By applying Bernstein’s inequality to ||w ||2 = obtain 


P 



E[\\uj 


> r < 2 exp 


■ f 

-Co mm I 


4K^n 2K‘^ 


where cq is an absolute constant. Setting r 


2K‘^n and using (113), we get 


P(||a;|| 2 >y/( 2 K 2 + I)n) < 2 exp(—Con) 


( 121 ) 


Next we bound R* [X'^v] for any u G S” Given any fixed c G we note that 


R*{X'^v) = sup {X'^v,u) , 

R(u)<l 


and X'^v is a sub-Gaussian random vector with |||X^u|||^^ < Ck as shown in the proof of Theorem 
Using Theorem 1^ we have 


P [R*{X'^v) > uqCk • w{VtR) + t) < 1^1 exp 


\1'2Ck4> 



( 122 ) 


Letting c = uqC, ci = i^i + 2 and C 2 


1 ^ 20 , and combining ( | 121 | ) and ( | 122 | ), we complete the proof. 


C.3 Proof of Lemma |3] 

Lemma|^ Let Lin = ^ R{u) < 1} be the unit norm ball and Qr = {u '■ R{u) = 1} be the boundary. 

For any 0 G 0_r define p{9) = sup 0 .^( 5 ))<^ \\9 — 9\\2 is the diameter o/LIr measured with respect to 9. If 
G{9) = cone(Uij — 0 ) H p{9)B2, i.e., the cone of {LIr — 9) intersecting the ball of radius p{9). Then 

w{LIr) < inf w{G{9)) (123) 

e&Bn 

Proof: For any 9 G Qr, consider the set Fr{9) = LIr — 9 = {u : R{u -|- 0) < 1}. Since Gaussian 
width is translation invariant, the Gaussian width of LLr and Fr are the same, i.e., w{LIr) = w{Fr{9)). 
Since, p{9) = supe.^( 5 ))<^ ||0 — 0||2 is the diameter of Qr as well as Fr{9), a ball of radius p{9) will include 
Fr{9), so that Fr(9) C pWiB^. Further, by definition, Fr(9) C cone(FR(0)) = conGQR — 9). Let 
G{9) = cone{QR - 0) n p(0)5f. By construction, Fr{9) C G(0). Then, 

w{Qr) = w{Fr{9)) < w{G{9)) . 

Noting the analysis holds for any 0 G Qr, completes the proof. ■ 


D Restricted Eigenvalue Conditions: Sub-Gaussian Designs 

We focus on results in Section]^ In particular we consider RE conditions for sub-Gaussian design matrices 
for three different cases: (i) the design matrix has independent sub-Gaussian rows Xi with < n, 

(ii) the design matrix has independent rows with subGaussian elements Xij so that |||xij|||^^ < k and the 
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columns are correlated, and (ill) the columns are independent subGaussian but the rows are correlated, i.e., 
correlated samples. One can view (ii) as a special case of (i), but we highlight this special case because of 
its practical importance and past literature on RE conditions for anisotropic subGaussian designs 133] iMl- 

Our results simply use a general treatment developed in Il28]l . building on ||2T1, based on Talagrand’s generic 
chaining lISTlIMl . We specifically focus on results in ll28l which provide uniform bounds on the supremum 
of certain empirical processes. RE results for the specific cases of interest in the current paper will then 
be established by suitable choices of these empirical processes. The results in |[2^ . and more generally in 
generic chaining 1371 |38l, are based on certain 7 -functionals which we briefly introduce below. 

Consider a metric space (T, d) and for a finite set A <Z T, let \A\ denote its cardinality. An admissible 
sequence is an increasing sequence of subsets {An^n > 0} of T, such that |,4.o| = 1 and for n > 1, 
\An\ = 2 ^". Given a > 0 , we define the 7 Q,-functional as 

00 

ja{T,d) = inf sup ^Diam(A„(t)) , (124) 

*6^ n=0 

where An{t) is the unique element of An that contains t, Diam(A„(t)) is the diameter of An according to d, 
and the infimum is over all admissible sequences of T. To get the desired RIP results in terms of Gaussian 
widths, we start with the following key result, originally l28l Theorem D]. 


Theorem 10 (Mendelson, Pajor, Tomczak-Jaegermann Il28l ) There exist absolute constants ci, C 2 , c^for 
which the following holds. Let (Q, /i) be a probability space, set F be a subset of the unit sphere of L 2 {p), 
i.e., F C Sl 2 = {/ • 111 / 111^2 ~ <^nd assume that snpf^p lll/llli /,2 — Then, for any 9 > 0 and n > 1 
satisfying 

ciK72(F, 111 - 111 ^ 2 ) < 9y/n , (125) 

with probability at least 1 — exp(—C 2 d^n/K^), 


sup 

/6F 




< 9 . 


(126) 


Further, if F is symmetric, then 


E 


sup 

/6F 




< C 3 max 


2k 


72{F, 


II72 


n 


) 



n 


(127) 


We use the above result and related arguments to establish RIP conditions for the cases of interest. 

D.l Isotropic Sub-Gaussian Designs 

We consider the case where the design matrix X G has independent subGaussian rows where each 

row satisfies ||| 2 fi |||^2 — ^ F[XiX^] = Ipxp- Thus, the measure p, from which the rows Xi are sampled 
independently is an isotropic sub-Gaussian measure. 

Theorem 11 Let X be a design matrix with independent isotropic subGaussian rows, i.e., ||| < k and 

E[XiXj'] = I. Then, for absolute constants r],c > 0, with probability at least (1 — 2 exp(— 7 t(;^(A))), we 
have 

sup 
u&A 


-||Xu||2-l 

n 


= sup 


n 




u) 


- 1 


< c 


w(A) 


n 


(128) 
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or, equivalently, 



< 


inf —||Xtt|p < sup —||Xu||^ < 

"uGA Tl uGA ^ 


1 + c 


'w{A) 


n 


(129) 


Proof: The result essentially follows from an application of Theorem [T^ For convenience of notation, let 
Xq be i.i.d. as the rows Xi,i = 1,... ,n, thus distributed following fi. To apply Theorem [TO) we choose 
any A C 5^“^ consider the following class of functions: F = {(-ju) : u G A]. Then, /{Xq) = {Xq,u) 
and F is a subset of the unit sphere, i.e., F C Sl 2 , since |||/||| 2,2 = E[n^X q Xqu] = ||u|p = 1. Further, 

sup/eF 111/111^2 = sup^g^l III Ao,'w)|||v,2 < IIAo|||v,2 < «^/2- 

Next, we show that for the current setting, the 72 -functional can be upper bounded by w{A), the Gaussian 
width of A. Since /U is isotropic subGaussian with ^/) 2 -norm bounded by k, we have 


72(Fn 5 l2, IIMII 72 ) ^ «^72(^n 5 l2, III- 111 x 2 ) ^ i^Ciw{A) , (130) 

where the last inequality follows from generic chaining, in particular llTTl Theorem 2.1.1], for an absolute 
constant C 4 > 0 . 

In the context of Theorem [T^ we choose 

2 w{A) 72(-^ G S'x 2 5 IIMII 72 ) 

d = C 1 C 4 K - I=r- > CiK - 


/n ^/n 

so that the condition on 6 is satisfied. With this choice of 6, we have 

O'^nfn^ = C4C4u;^(^) . 


Then, from Theorem 10 it follows that with probability at least 1 — exp{—r]w‘^{A)), we have 


sup 

uGA 


1 


n 


Y.{Xi,u)^-l 


i=l 


< C 


n 


(131) 


where rj = C 2 cf c| and c = ciC 2 K^ are absolute constants. As a result, we have 


/ 1 " 


u)^ - 1 < c 


w{A) 


i&A \ 'IT' 


2=1 


and 


n 


sup ( 1 - - < 

„ 6 A \ I 


w{A) 


2=1 


n 


yielding 


l_cA^ < inf i||Au ||2 < sup -||Xu ||2 < 1 + c^^ 

/re “ ueA re" “ ^eA n " - ^ 


(132) 


That completes the proof. 


D.2 Anisotropic Sub-Gaussian Designs 


We now consider the case where the design matrix X G has independent rows, and each row Xi 

is anisotropic subGaussian with ElXf Xi] = S. Further, we assume that corresponding isotropic random 

< K. A simple special case of such an anisotropic subGaussian 


vector Xj = XX satisfies 


A. 


V’2 


A,; 


design is when Xi ~ A(0,S), where Xi = AjS ~ Ai(0,I) so that 
characterizes RlP-style property of any such anisotropic subGaussian designs. 


■02 


= 1. The result below 


30 


























Theorem 12 Let X be a design matrix with independent anisotropic subGaussian rows, i.e., E[Xj'Xi] = S 

< K. Then, for absolute constants p, c> 0, with probability at least (1—2 exp{—pw^ {A))), 

we have 


sup 

iiSA 


1 1 
n vfEu 


||Xu|| 2 -l 


sup 

uSA 


1 1 
n vfTiU 


i=l 


- 1 


< c 


w{A) 


n 


(133) 


Further, 


A^in(Sl^) < inf -||Xu||2 < sup -||Xu||2 < An,ax(S|Al) f 1 + V 

\ y/n J ueA n u&a n \ y/n J 

(134) 

where 

Amin(S|A) = inf u^TiU , and Amax(S|A) = sup (135) 

UGA 

are the restricted minimum and maximum eigenvalues ofE restricted to AG S^~^. 


Proof: The result also follows from an application of Theorem 10 For convenience of notation, let Xq 
be i.i.d. as the rows Xi,i = 1,... ,n, thus distributed following p. To apply Theorem [TOl we choose any 
A C consider the following class of functions: 


F = {fy,uGA:fy{-) = 




VvfEu 

Then, fu{Xo) = , {Xq,u) and F is a subset of the unit sphere, i.e., F C Sl 2 , since for fu^F 

V Id Sli 

,,, „ , 1,9 1 


(136) 




F[u^XjXou] = 1 


Next, we focus on getting an upper bound on supj^gp 
so that Xq is a isotropic vector with Xq 


^2 


\\\ju\\\^^ = SUp„g^ 

< K. Noting that 


1 {Xq,u) 


V u'^T.u 


'02 


. Let — 


{Xq,u) = (Xo,sV2^) = Vu^FuiXo, 




IISl /2 


^2 


we have 


sup |||/«|||^2 


sup 

nSA 


VvfFu 


{Xq,u) 


^2 


As a result, we have 


sup 

uSA 



SV 2 ^ 

||SV 2 u ||2 


< 

^0 

y >2 



< K . 


72(Fn S'l2, IIHII^^) ^ i^ri2{Fr\ Sl2A\\-\\\l 2) - i^Ciw{A) , 

where the last inequality follows from llTTl Theorem 2.1.1], for an absolute constant C 4 > 0. 
In the context of Theorem [T^ we choose 


9 = ciC 4 fi: 


2 w{A) ^ ^ ^^72(2^ n^L^, III 


n 


> ClK- 


n 


(137) 


so that the lower bound condition on 6 is satisfied. With this choice of 9, we have 

= C 4 C 4 u;^(A) . 
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Then, from Theorem 10 it follows that with probability at least 1 — exp{—riw^{A)), we have 


sup 

u£A 


1 1 


n 






2 = 1 


< c 


w{A) 


n 


where 77 = C 2 cf c| and c = ciC 2 K^ are absolute constants. As a result, we have 


sup (“0- 


w{A) 


n 


w{A) 


n 


From (1391, we have 


--sup-||Auf < sup-^i^||A:uf < 1 + 

'^max(^|-^) uGA ^ uGA ^ ^ \Tl 


SO that 


sup -||Au|p < Amax(S|A) + cA 

max (S|A) 

uEA ^ 


w{A) 


n 


Similarly, from (1401, we have 


1 - < inf - ^ ll^^ll^ < T- , , inf -||Xu|p , 

/n u&Anu^Hu'' Amin(5iI|A) usA n 


implying 


A„,in(S|A) - cA„,in(S|A)^ < inf -||Auf 

yjn u&A n 


Putting (141 1 and (1421 together completes the proof. 


( 138 ) 


(139) 

(140) 


(141) 
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E Generalized Linear Models: Restricted Strong Convexity 

We establish bounds on the regularization parameter and RSC condition for GLMs as discussed in Section]^ 
along with a few specific examples. 

E.l Generalized Linear Models 

Loss functions for GLMs are derived as maximum likelihood estimators for the family of exponential distri¬ 
butions. The canonical density function of exponential family distributions is given by HI l8l l44]l : 

P{y\r]) oc expjT/y - ip{r])} , (143) 

where rj is the natural parameter and has a one-to-one function mapping with the mean parameter y = 
E[y\ of the distribution, ip{r]) is the log-partition function which ensures that P{y\r]) remains a probability 
distribution. The gradient of the log-partition function is the response function, i.e., g{-) = which is 
monotonic by construction. The inverse of the response function is the so-called link function h{-) = 


32 























The mean of the distribution can be obtained from the gradient of the log-partition function at the natural 
parameter, i.e., 

^ = ip'{rf) = g{'q) . (144) 

The interested reader can study details and additional properties of exponential families from the existing lit¬ 
erature im m @31 . Examples of distributions from the exponential family include the Gaussian, multinomial, 
exponential, Dirichlet, Poisson, Gamma, etc. 

GLMs are obtained from conditional exponential family distributions by assuming a suitable parametric 
form of the natural parameter rj in terms of X and 6*, in particular rji = {Xi, 9*). Then, the conditional 
distribution is given by 

P{yi\Xi, 9*) oc exp{r 7 i?/i - = exp{(Xj, 0*)?/* - (p{{Xi, 9*))} . (145) 


The loss function for GLMs simply consider the negative log likelihood of such conditional exponential 
family forms. Assuming samples to be independent, we have 

. n 1 ^ 

C{9;Z^) = = _ _ V {0) - (^((X„ 0) ) } . (146) 

2 = 1 2=1 

Using chain rule, the first derivative of the loss function evaluated at 9* is 


z”) = -i x; V,x, + i x;^ 1 ^ xmyAxi - n) = . 

n n orii n n 

i=l i=l ' i=l 

where each element of w G is given as Wj = E{y\Xi) — yi. Next we look at some specific examples of 
exponential families and corresponding GLMs. 

1. Gaussian distribution: If the variance of the Gaussian distribution P{y\Xi) is assumed to be 1, then we 
have 


P{y^\Xi■9*)Axe^v{yi{X^.0*)- 


{Xi,9 


* \ 2 


Comparing it with the canonical form given earlier, the natural parameter is 7]i = {Xi, 9*), log-partition 
function is (p{{Xi,9*)) = ^ and hence = p!{{Xi,9*)) = {Xi,9*). The noise tUj = Vi— 

E{yi\Xi) is Gaussian. Considering the negative log-likelihood, the GLM corresponding to the Gaussian 
distribution yields least squares regression [8]. 

2. Bernoulli distribution: Assuming the conditional distribution of yi\Xi,9* to have a Bernoulli distribu¬ 
tion with conditional mean parameter pi, which is a suitable function of {Xi, 9*), the likelihood of the 
observations is given by 


P{yi\Pi) =pf (1 = exp(yilogpi + (1 - yi) \og{l - Pi)) = exp log J -f log(l -p*) 

Therefore, the natural parameter rji = {Xi,9*) = log giving pi = 

verified from the fact that the probability density function P{yi\pi) adds to 1 and the log-partition function 
evaluates to (p(r/j) = log(l—pi) = log(l-|-exp((Ai, 0*))). The noise in the model corresponds to random 
draws from a Bernoulli distribution, and each element of uj is uii = pi — yi = ~ which 

is bounded and hence sub-Gaussian. Considering the negative log-likelihood, the GLM corresponding to 
the Bernoulli distribution yields logistic regression [81. 
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3. Poisson distribution: Assuming the conditional distribution of 6* to have a Poisson distribution with 
conditional mean parameter Aj, which is a suitable function of (Xj, 9*), the likelihood of the observations 
is given by 

\y^ exof—A'l 

P{yi\^i) = — -j- — oc exp{log(Af exp(-Ai))} = exp{yjlog A* - A*} , 

Vi- 

where the l/yi\ term constitutes the base measure for the distribution. Based on the form, the natural 
parameter rji = (Aj, 6*) = log A* giving A* = exp(r 7 j) = exp((Aj, 9*)). Also it can be verified that the 
log-partition function ip{r]i) = exp(r 7 j) = A*. Each element of w is cuj = A* — ?/* = exp((Aj,0*)) — 
Hi. Considering the negative log-likelihood, the GEM corresponding to the Poisson distribution yields 
Poisson regression l| 8 l. 

As discussed in Section if the design matrix X and the noise is assumed to be sub-Gaussian, then the 
regularization parameter A^ needs to be O(^), following the analysis and results in Section In the rest 
of this section, we focus on proving the GEMs with sub-Gaussian designs satisfy the RSC condition with 
sample complexity depending on the width of the spherical cap corresponding to the error set, as discussed 
in Section 12 

E.2 RSC condition for GLMs 

Eor any convex loss function to satisfy the RSC condition on any A C S^~^, the following inequality 

6£{9*,u-,Z^) = £{9* +u-,Z^) - £{9*;Z^) - iyC{9*-Z^),u) > k\\u\\1 (147) 

needs to hold Vn G Eor the general formulation of GEM discussed earlier, we have 

- n 1 ^ 1 ^ 

5C{9\u-, Z^) = -(r + n, - V ViX,) + - V ^{{9* + u, A,)) + (r , - V ViXi) 

n n n 

2 = 1 2=1 2=1 

-22 1 ^ 1 ^ 

n n n 

2 = 1 2=1 2=1 

Simplifying the expression and applying mean value theorem twice we get the following 

n 

A-) = - V / ((r, A^) + 7 ,(u, A^)) {u^X^f , (148) 

n 

i=l 

for suitable 7 * G [0,1]. The RSC condition for GEMs then needs to consider lower bounds for 

n 

6C{9*,u-,Z^) =-y^^''{{9*,Xi)+ji{u,Xi)){u,Xif (149) 

2=1 

where 7 ^ G [0,1]. The second derivative of the log-partition function is always positive. Since the RSC 
condition relies on a non-trivial lower bound for the above quantity, the analysis will suitably consider a 
compact set where i = £ip{T) = min|a|< 2 T («) is bounded away from zero. The only assumption outside 
this compact set {a : |a| < 2T} is that the second derivative is greater than 0. Eurther, we assume ||0* II 2 < ci 
for some constant ci. With these assumptions 

6C{9*,u;Z^) > -Y,{Xi,u)H[\{Xi,9*)\ <T]I[\{Xi,u)\ <T] . (150) 

2=1 
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We give a characterization of the RSC condition for isotropic sub-Gaussian design matrices X G R^^p. We 
consider u € A C SP~^ so that ||n ||2 = 1- Further, we assume || 6 l *||2 < ci for some constant ci- Assuming 
X has sub-Gaussian rows with |||A^i |||^2 — ^*) sub-Gaussian random variables with 

sub-Gaussian norm at most Ck. 

Let El and £2 denote the tail probability that (Xj, u) and (Xj, 9*) exceeds some constant T, i.e., £i(T; tt) = 
P{\{Xi,u) \ >T}< e-exp(-C2T^iC^K^) = £ 1 , andE2(T; 9*) = F{l(Xi,9*}l > Tj < e-exp(-C2T^{C^K^) 
£ 2 , where ei = £i(T, k) and £2 = £2(7"; k) are uniform upper bounds on the individual tail probabilities. 
The result we present below is in terms of the above defined constants £ = 1^{T), £1 = £i(r, k) and 
£2 = £2 {T, k) for any suitably chosen T. 

Theorem Let X G R'^^P be a design matrix with independent isotropic sub-Gaussian rows such that 
||| 7 fi |||^2 — Then, for any set A C SP~^ for suitable constants rj,c > 0, with probability at least 
1 — 2 exp {^—pnP‘{A)), we have 

inf d£{9*;u,X) > £p‘^ . (151) 

u^A - \ y/n ) 


where fP' = inf^g^p^^ with = E[{Xi,u)H[\{Xi,9*)\ < r]I[|(Aj, tt)| < T]], and ki = 

Proof: For any fixed T, let Zi = Zf = (Tfi, u)I(|(A*, u)| < T)I{\{Xi,9*)\ < T). Then, the probability 
distribution over Z, can be written asn 


P{Zi = z) 


p{{Xi,u) = z)i{\{x,,u)\ <T)i(|(A„r)| <r) 
P{\{Xi,u)\<T,\{Xi,9*)\<T) 


1 

< -z-^ 

1 - £1 - £2 


P{{X,,u) 


z). (152) 


As a result, |||Zj|||^^ < = ni- Thus, Zi = Zf is a sub-Gaussian random variable for any u £ A. 

Let pI = E[{ZP)'^] > 0. Let Ao be i.i.d. as the rows Xi,i = 1,... ,n. Let A C S'p ^ and consider 
the following class of functions: E = {^(•,M)I(|(-,tt)| < T)I(|(-,0*)| < T) : u £ A}. Then for any 
f £ F, /(Ao) = ^(Ao, u)I(|(Ao, u)| < T)I(|(Ao,0*)| < T) and, by construction, F is a subset of the 
unit sphere, i.e., F C Sl^- Further, supjg^ IWfWlp^ - ^i/^- 

Next, we show that for the current setting, the 72 -functional can be upper bounded by w{A), the Gaussian 
width of A. Since the process is sub-Gaussian with (p 2 -norm bounded by ki, we have 


72 ( 7 " n Sl 2 , III-III< Ki72(7"n 5 l2 , IIHII^J < KiCiw{A) , (153) 

where the last inequality follows from generic chaining, in particular lITTl Theorem 2.1.1], for an absolute 
constant C 4 > 0 . 

In the context of Theorem [T^ we choose 


9 = CiC4/if 


2w(A) ^ 72(FnS'L2, 


> ClKl- 


IIP2: 


/n y n 

so that the condition on 9 is satisfied. With this choice of 9, we have 

9'^n/Kf = Ciclw^{A) . 

Then, from Theorem 10 it follows that with probability at least 1 — exp(—^^^(A)), we have 

n 

— J](A7u)2i(|(Ao,n)| <r)i(KAo,r)| <r)-i 


sup 

u£A 


Pun 


2=1 




(154) 


(155) 


(156) 


^With abuse of notation, we treat the distribution over Zi as discrete for ease of notation. A similar argument applies for the true 
continuous distribution, but more notation is needed. 
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where rj = C 2 cfc| and c = C 1 C 2 are absolute constants. Thus, with probability at least 1 — e:K.p{—r]w‘^(A)), 

mi-^{Xi,u)Hi\{Xo,u)\<T)I{\{Xo,e*)\<T)>mf pl(l-cKl'^) . (157) 

u&A n nsA y J 

Denoting = inf^g^ with probability at least 1 — exp(—we have 

inidC{e*;u,X)> inf -^{Xi,u)H[\{Xi,9*)\ < T]l[\{Xi,u)\ < T]> Ip^ (l - . 

u&A u&A n ^ - \ Jn / 

1=1 

(158) 

That completes the proof. ■ 
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